-
Notifications
You must be signed in to change notification settings - Fork 3
Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities #1815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine! I added some requests for clarity, but other than that, feel free to merge.
| """ | ||
| This file was created as an updated utils file for the ParlaMint dataset, version 4.0. The previous utils file | ||
| is based on version 2.0. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note: module level docstrings should be the first line of the file. If you put it after the imports, it's not registered by help, code editors, etc.
I think by "the previous utils file", you're referring to parliament.py? This reference is ambiguous. Also, someone not familiar with the code history would not know which file came first chronologically.
If I were looking over a directory with parliament.py and parliament_v4.py, I would probably assume that parliament.py covers the most recent version and should be the default for new corpora, and the v4 module is for compatibility with some older version.
| else: | ||
| return False | ||
|
|
||
| def transform_current_party_id(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could do with a docstring. What is data? What is is transformed into?
Based on the name, I assumed this function transformed the current party ID into something else, but based on the code, it looks like this function retrieves the current party ID from other data? If so, the name is a bit misleading.
| def annotated_text_mapping(): | ||
| return {'type': 'annotated_text'} | ||
| def ner_mapping(): | ||
| return {'type': 'text', 'index': False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the issue with the 'annotated_text' type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1724
TLDR: it wasn't necessary because we already have the unannotated text for full-text search, and keyword fields for finding entities.
The current branch uses the
parlamint_v4.pyutils file from #1730. I changed it in the following ways:parliament.utils.parlamint.pyutils